Why Not Just Use Whisper

Running Whisper locally gets you a transcript, but without speaker labels it's a wall of text: you can't tell who said what. For a two-person conversation that's annoying; for a panel discussion it's unusable. Cloud services like Otter.ai handle this, but they cost money per minute and send your audio to someone else's servers.

I wanted something that runs on my own hardware, identifies speakers, and produces output clean enough to read. Not perfect, just good.

Three Models, One Pipeline

The pipeline chains three specialized systems:

NVIDIA Parakeet TDT 0.6B (via NeMo) handles speech-to-text. The TDT variant produces word-level timestamps, which is critical for speaker assignment later. pyannote-audio 3.1 handles diarization, figuring out who is speaking when. It outputs time segments labeled "Speaker A," "Speaker B," etc., covering roughly 80% of the audio (the gaps from crosstalk, silence, and background noise get labeled "Unidentified"). Ollama handles post-processing: a local LLM takes the raw transcript with speaker labels and cleans up transcription errors, assigns real names when possible, and smooths out artifacts.

Merging Chunks at Boundaries

Long audio files can't be processed in one shot, memory constraints and model limitations force chunking. But naive chunking creates problems at the boundaries: if you split at the 10-minute mark, any word being spoken at that exact moment either gets duplicated (appears in both chunks) or lost (appears in neither).

My solution uses a 600-second buffer with a 2-second stride overlap. Each chunk overlaps the next by 2 seconds, and the overlap region is merged using Levenshtein distance matching: find the best alignment between the end of chunk N and the beginning of chunk N+1, then splice.

This took more debugging than anything else in the pipeline. What if the overlap region contains silence? What if the speaker changes mid-overlap? What if NeMo transcribes the same phrase slightly differently in each chunk? The Levenshtein approach handles most of these, but tuning the match threshold required a test suite of deliberately adversarial audio samples. Too aggressive and distinct phrases get collapsed together; too conservative and words get duplicated.

Speaker Assignment

Once I have word-level timestamps from NeMo and speaker segments from pyannote, each word's midpoint falls into exactly one speaker segment (or into a gap, which maps to "Unidentified").

One NeMo quirk cost me an afternoon: the word timestamps live in result.timestamp['word'], not in a timestamps attribute. The documentation doesn't make this obvious.

Speaker A [00:01:23-00:01:45]: I think the main issue with current approaches is...
Speaker B [00:01:45-00:02:03]: Right, and that's exactly why we need to consider...
Unidentified [00:02:03-00:02:05]: [crosstalk]
Speaker A [00:02:05-00:02:30]: Sorry, go ahead.

Preprocessing

Raw podcast audio needs normalization before anything else, volume levels vary between speakers, some recordings have background music, sample rates are inconsistent:

ffmpeg -i input.mp3 -af dynaudnorm -ar 16000 -ac 1 output.wav

Dynamic audio normalization (dynaudnorm) evens out volume differences without clipping. Resampling to 16 kHz mono is what both NeMo and pyannote expect. This single command eliminates a whole category of downstream errors.

LLM Cleanup

The raw pipeline output is functional but rough. The Ollama post-processing step takes the full transcript and does three things: infers speaker names from context (if Speaker A says "Thanks for having me on the show, Sarah," then Speaker B is probably Sarah; a 7B model handles this well), fixes obvious transcription errors where context makes the intended word clear, and adds paragraph breaks at topic transitions.

Keeping this step local (rather than calling GPT-4) means the entire pipeline stays offline. A 7B parameter model is more than enough for "their" vs. "there."

Reflections

The chunk merging ate most of my debugging time; the core idea is straightforward but edge cases multiply fast. Diarization is the weakest link: pyannote's 80% coverage means one in five seconds of audio has no speaker label, and it struggles with crosstalk and speakers with similar voices. Good enough for my purposes, but it's the component I'd replace first if something better comes along.

The pipeline takes roughly 1.5× real-time on a consumer GPU, which is fast enough to be practical.